perm filename MANGAL.TEX[TEX,DEK]1 blob sn#371074 filedate 1978-07-31 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00004 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	\input manhdr 
C00003 00003	\specialappbegin H. {Hyphenation}
C00027 00004	\specialappbegin S. {Special notes about using \TEX\ at Stanford}
C00031 ENDMK
C⊗;
\input manhdr 
\titlepage
\tenpoint
\vfill
\ctrline{\:<TAU EPSILON CHI, a system for technical text}
\ctrline{$\copyright$ 1978 by D. E. Knuth}
\ctrline{(sample galleys)}
\vfill
\gdef\chead{Galleys}
\manmark{\chead}{\chead}
\setcpage 0
\eject
\specialappbegin H. {Hyphenation}
\ninepoint
The conditions under which \TEX\ will try to hyphenate a word are discussed
in Chapter 14. Now let's consider how hyphenation is actually accomplished.

\def\.#1{\hjust{\tt#1}}
It seems to be undesirable to look for the set of all possible places to hyphenate
every given word. For one thing,
the problem is extremely difficult, since the word ``\.{record}'' is supposed to
be broken as ``\.{rec-ord}'' when it is a noun but ``\.{re-cord}''
when it is a verb. We might consider also
the word ``\.{hyphenation}'' itself, which appears to be rather an exception:
$$\hjust{\.{hy-phen-a-tion}\quad vs.\quad\.{con-cat-e-na-tion}}\quad.$$
Why does the ``\.n'' go with the ``\.a'' in one case and not the other?
Starting at letter
\.a in the dictionary and trying to find rigorous rules for hyphenation without
much knowledge, we come up against \.{a-part} vs.\ \.{ap-er-ture},
\.{aph-o-rism} vs.\ \.{a-pha-sia},
etc. It becomes clear that what we want is not an accurate but ponderously slow
routine that consumes a lot of memory space and processing time; instead we want
a set of hyphenation rules that are
$$\vjust{\halign{\hfill# ⊗#\hfill\cr
a)⊗simple enough to explain in a couple of pages,\cr
\noalign{\vskip 3pt}
b)⊗almost always safe,\cr
\noalign{\vskip 3pt}
and\quad c)⊗powerful enough that bad breaks due to
missed hyphenations are very rare.\cr}}$$
Point (c) means that a proofreader's job should be only negligibly more difficult
than it would be if an intelligent human being were doing all of the hyphenations
needed to typeset the same material.

So here are the rules \TEX\ uses (found with the help of Frank Liang):

\yskip\textindent{\sl 1)}{\sl Exception removal.}\xskip If
the first seven letters of the word appear in a small internal dictionary
of words to be treated specially (about 350 words in all, see below), use the
hyphenation found in that dictionary. Furthermore some of the entries in the
dictionary specify looking at more than seven letters to make sure that
the exception is real; e.g., ``\.{in-form-ant}'' wouldn't be distinguished from the
unexceptional word ``\.{in-for-ma-tion}'' on the basis of seven
letters alone. If the given
word has seven letters or less and ends with ``\.s'', the word minus the \.s is also
looked up. The dictionary contains nearly all the common English words for
which the rules below would make an incorrect break, plus additional words
that are common in computer science writing and whose breaks are not satisfactorily
found by the rules.

\yskip\textindent{\sl 2)}{\sl Suffix
removal.}\xskip A permissible hyphen is inserted if the word ends with
\.{-able} (preceded by \.e, \.h, \.i, \.k, \.l, \.o, \.u, \.v, \.w, \.x, \.y or
\.{nt} or \.{rt}), \.{-ary} (preceded by \.{ion} or
\.{en}), \.{-cal}, \.{-cate} (preceded by a vowel), \.{-cial}, \.{-cious} (unless
preceded by \.s),
\.{-cient}, \.{-dent}, \.{-ful}, \.{-ize} (preceded by \.l), \.{-late} (preceded 
by a vowel), \.{-less}, \.{-ly},
\.{-ment}, \.{-ness}, \.{-nary} (unless preceded by \.e or \.{io}), \.{-ogy},
\.{-rapher}, \.{-raphy},
\.{-scious}, \.{-scope}, \.{-scopic}, \.{-sion}, \.{-sphere}, \.{-tal}, \.{-tial},
\.{-tion}, \.{-tion-al}, \.{-tive},
\.{-ture}. Here a ``vowel'' is either \.a, \.e, \.i, \.o, \.u, or \.y;
the other 20 letters are ``consonants.''

There is also a somewhat more complex rule for words ending with ``\.{ing}'':
If \.{ing} is preceded by fewer than four letters, insert no permissible hyphens.
Otherwise if \.{ing} is preceded by two identical consonants other than \.f, \.l,
\.s, or
\.z, break between them.  Otherwise if it is preceded by a letter other than \.l,
break the \.{-ing}. Otherwise if the letter before \.{ling} is \.b, \.c, \.d, \.f,
\.g, \.k, \.p, \.t, or  \.z,
break before this letter (except break \.{ck-ling} if the word ends with
\.{ckling}). Otherwise break \.{-ing}.

Furthermore the same suffix removal routine is applied to the residual word after
having successfully found the suffixes \.{-able}, \.{-ary}, \.{-ful}, \.{-ize},
\.{-less}, \.{-ly}, \.{-ment},
and \.{-ness}. If the original word ends in \.s and no suffix was found, the
final \.s is removed and the suffix routine is applied again. If
the original word ends in \.{ed} the suffix routine is applied to the word with the
final \.d removed, and (if that is unsuccessful) to the word with final \.{ed}
removed.

Any suffixes found are effectively removed from the word, not examined by
rules 3 and 4. If the original word ends with \.e or \.s or \.{ed}, this final
letter or pair of letters is also effectively removed.

\yskip\textindent{\sl 3)}{\sl Prefix removal.}\xskip
A permissible hyphen is inserted if the word begins with
\.{be-} (followed by \.c, \.h, \.s, or \.w), \.{com-}, \.{con-}, \.{dis-}
(unless followed by \.h or \.y),
\.{equi-} (unless followed by \.v), \.{equiv-}, \.{ex-}, \.{hand-}, \.{horse-},
\.{hy*per-}, \.{im-},
\.{in-} (but use \.{in*ter-} or \.{in*tro-} if present), \.{lex*i-}, \.{mac*ro-},
\.{math*e-},
\.{max*i-}, \.{min*i-}, \.{mul*ti-}, \.{non-}, \.{out-}, \.{over-}, \.{pseu*do-},
\.{quad-}, \.{semi-}, \.{some-},
\.{sub-}, \.{su*per-}, \.{there-}, \.{trans-} (followed by \.a, \.f, \.g, \.l, or
\.m),
\.{tri-} (followed by \.a, \.f, or \.u), \.{un*der-}, \.{un-} (unless followed by
\.{der} or \.i).
Here an asterisk denotes a second permissible hyphen to be recognized, but
only if the entire prefix appears.

After the prefixes \.{dis-}, \.{im-}, \.{in-}, \.{non-}, \.{over-}, \.{un-} have
been recognized as stated, the prefix
routine is entered again. Any prefixes found are effectively removed from
the word, and not examined by rule 4.

\yskip\textindent{\sl 4)}{\sl Study of consonant pairs.}\xskip
In the remainder of the word, after suffixes and
prefixes have been removed, we combine the letter pairs \.{ch}, \.{gh}, \.{ph},
\.{sh}, \.{th},
treating them as single consonants. 

If the three-letter combination \.{XYY} is found, where \.X is a vowel and \.Y a
consonant, break between the \.Y's, except if \.Y is \.l or \.s. In the latter case,
break only if the following letter is a vowel and the word doesn't end ``\.{XYYer}''
or ``\.{XYYers}''.

If the three-letter combination \.{Xck} is found, where \.X is a vowel, break
after the \.{ck}.

If the three-letter combination \.{Xqu} is found, where \.X is a vowel, break
before the \.{qu}.

If the four-letter combination \.{XYZW} is found, where \.X and \.W are vowels and
\.Y and \.Z are consonants, break between the consonants unless \.{YZ} is one of
the following pairs:
$$\vjust{\halign{\hfill\tt#\hfill\cr
bl, br, cl, cr, chl, chr, dg, dr, fl, fr, ght, gl, gr, kn, lk, lq,\cr
nch, nk, nx, phr, pl, pr, rk, sp, sq, tch, tr, thr, wh, wl, wn, wr.\cr}}$$
Furthermore do not break between the consonants if the word ends with
\.{XYZer}, \.{XYZers}, \.{XYZage}, or \.{XYZages}, when \.{YZ} is one of the pairs
$$\.{ft, ld, mp, nd, ng, ns, nt, rg, rm, rn, rt, st.}$$

\yskip\textindent{\sl 5)}{\sl Retaining short ends.}\xskip
After applying rules 1 thru 4, take back all ``permissible'' breaks that
result in only one or two letters after the break, or that have only one
letter before it, or that have only one letter between prefix and suffix.
(Thus, for example, the suffix rule will break \.{-ly}, but this won't
count in the final analysis; it does affect the hyphenation algorithm, however, 
since the suffixes in words like ``\.{rationally}'' will be found by repeated 
suffix removal.)

Also, take back any break leading to the syllable \.{-e}, \.{-xe}, or \.{-xye},
where
\.x and \.y are any two letters and where this \.e occurs at the end of the shortest
subword on which suffix removal was tried in rule 2. (This rule avoids syllables
with ``silent e''. For example, we do not wish to hyphenate \.{rid-dle},
\.{proces-ses},
\.{was-teful}, \.{arran-gement}, \.{themsel-ves}, \.{lar-gely}, and so on.)

\yyskip {\bf Example of hyphenation:}$$\.{su-per-califragilis-ticex-pialido-cious.}
$$(This is a correct subset of the ``official'' syllabification specified
by the coiners of this word, namely
{\tt su-per-cal-i-frag-il-is-tic-ex-pi-al-i-do-}etc.)

\yskip Now here's the dictionary of words that should be handled separately,
as mentioned in rule 1.
(When an asterisk appears, it means that this letter is checked too, in addition
to the first seven letters.)

First, we include the following words since they are exceptions to the
suffix rules:
$$\def\\{\noalign{\penalty-200}}\halign{\tt# \hfill⊗\tt#\hfill\cr
(-able)⊗con-trol-lable eq-uable in-sa-tiable ne-go-tiable\cr
⊗so-ciable turn-table un-con-trollable un-so-ciable\cr
(-dent)⊗de-pend-ent in-de-pend-ent\cr
\\(-ing)⊗any-thing bal-ding dar-ling dump-ling err-ing eve-ning\cr
⊗every-thing far-thing found-ling ink-ling main-spring\cr
⊗nest-ling off-spring play-thing sap-ling shoe-string\cr
⊗sib-ling some-thing star-ling ster-ling un-err-ing\cr
⊗up-swing weak-ling year-ling\cr
\\(-ize)⊗civ-i-lize crys-tal-lize im-mo-bi-lize me-ta-bo-lize\cr
⊗mo-bi-lize mo-nop-o-lize sta-bi-li*ze tan-ta-lize\cr
⊗un-civ-i-lized\cr
\\(-late)⊗pal-ate\cr
\\(-ment)⊗in-clem-ent\cr
\\(-ness)⊗bar-on-ess li-on-ess\cr
\\(-ogy)⊗eu-logy ped-a-gogy \cr
\\(-scious)⊗lus-cious\cr
\\(-sphere)⊗at-mos-phere\cr
\\(-tal)⊗met-al non-metal pet-al post-al rent-al\cr
(-tion)⊗cat-ion\cr
(-tive)⊗com-bat-ive\cr
(-ture)⊗stat-ure\cr}$$

Exceptions to the prefix rules:
$$\halign{\tt# \hfill⊗\tt#\hfill\cr
(be-)⊗beck-on bes-tial\cr
(com-)⊗com-a-tose come-back co-me-dian comp-troller\cr
(con-)⊗cone-flower co-nun-drum\cr
(equi-)⊗equipped\cr
(hand-)⊗handle-bar\cr
(in-)⊗inch-worm ink-blot inn-keeper\cr
(inter-)⊗in-te-rior\cr
(mini-)⊗min-is-ter min-is-try\cr
(non-)⊗none-the-less\cr
(quad-)⊗qua-drille\cr
(some-)⊗som-er-sault\cr
(super-)⊗su-pe-rior\cr
(un-)⊗u-na-nim-ity u-nan-i-mous unc-tous\cr}$$

Exceptions to the consonant rules:
$$\def\\{\noalign{\penalty-200\vskip.5pt plus .5pt}}\halign{\tt# \hfill⊗\tt
#\hfill\cr
bt:⊗debt-or \cr
\\ck:⊗ac-know-ledge\cr
\\ct:⊗de-duct-i*ble ex-act-i-tude in-ex-act-i-tude\cr
⊗pre-dict-*able re-spect-*able un-pre-dict-able vict-ual\cr
\\dl:⊗needle-work idler\cr
\\ff:⊗buff-er off-beat off-hand off-print off-shoot off-shore stiff-en\cr
\\ft:⊗left-ist left-over lift-off\cr
\\fth:⊗soft-hearted\cr
\\gg:⊗egg-nog egg-head\cr
\\gn:⊗cognac for-eign-er vignette\cr
\\gsh:⊗hogs-head\cr
\\ld:⊗child-ish eld-est hold-out hold-over hold-up\cr
\\lf:⊗self-ish\cr
\\ll:⊗bull-ish crest-fallen dis-till-*ery fall-out lull-aby\cr
⊗roll-away sell-out wall-eye\cr
\\lm:⊗psalm-ist\cr
\\ls:⊗else-where false-hood\cr
\\lt:⊗con-sult-ant volt-age\cr
\\lv:⊗re-solv-able re-volv-er solv-able un-solv-able\cr
\\mb:⊗beach-comber bomb-er climb-er plumb-er\cr
\\mp:⊗damp-en damp-est\cr
\\nch:⊗clinch-er launch-er lunch-eon ranch-er trench-ant\cr
\\nc:⊗an-nouncer bouncer fencer hence-forth mince-meat si-lencer\cr
\\nd:⊗bind-ery bound-ary com-mend-*a-*t*ory de-pend-able\cr
⊗ex-pend-able fiend-ish land-owner out-land-ish round-about\cr
⊗send-off stand-out\cr
\\ng:⊗change-over hang-out hang-over ha-rangue me-ringue\cr
⊗orange-ade tongue venge-ance\cr
\\ns:⊗sense-less\cr
\\nt:⊗ac-count-ant ant-acid ant-eater count-ess rep-re-sentative\cr
\\nth:⊗ant-hill pent-house per-cent-*age\cr
\\pt:⊗ac-cept-able ac-ceptor adapt-able adapt-er crypt-analysis\cr
⊗in-ter-ru*p*t-*i*ble\cr
\\qu:⊗an-tiq-uity ineq-uity iniq-uity liq-uefy liq-uid liq-ui-date\cr
⊗liq-ui-da-tion liq-uor pre-req-ui-site  req-ui-si-tion\cr
⊗u-biq-ui-tous\cr
\\rb:⊗ab-sorb-ent carb-on herbal im-per-turb-able\cr
\\rch:⊗arch-ery arch-an-gel re-search-er un-search-able\cr
\\rd:⊗ac-cord-ance board-er chordal hard-en hard-est haz-ard-ous\cr
⊗jeop-ard-ize re-corder stand-ard-ize stew-ard-ess yard-age\cr
\\rf:⊗surf-er\cr
\\rg:⊗morgue\cr
\\rl:⊗curl-i-que\cr
\\rm:⊗af-firm-a-*t*i*ve con-form-*ity de-form-ity in-form-a*nt\cr
⊗non-con-form-ist\cr
\\rn:⊗cav-ern-ous dis-cern-ible mod-ern-ize turn-about turn-over\cr
⊗un-gov-ern-able west-ern-ize\cr
\\rp:⊗harp-ist sharpen\cr
\\rq:⊗torque\cr
\\rs:⊗coars-en ir-re-vers-ible nurse-maid nurs-ery\cr
⊗re-hears-al re-vers-ible wors-en\cr
\\rt:⊗art-ist con-vert-ible court-yard fore-short-en heart-ache\cr
⊗heart-ily short-en\cr
\\rth:⊗apart-heid court-house earth-en-ware north-east north-ern\cr
⊗port-hole\cr
\\rv:⊗nerv-ous ob-serv-a*ble ob-server pre-serv-*a-*t*i*ve serv-er\cr
⊗serv-ice-able\cr
\\sch:⊗pre-school\cr
\\sc:⊗con-de-scend cre-scendo de-cre-scendo de-scend-ent de-scent\cr
⊗pleb-i-scite re-scind sea-scape\cr
\\sk:⊗askance snake-skin whisk-er\cr
\\sl:⊗cole-slaw\cr
\\sn:⊗rattle-snake\cr
\\ss:⊗class-ify class-room cross-over dis-miss-al ex-press-ible\cr
⊗im-pass-able less-en pass-able toss-up un-class-i-fied\cr
\\st:⊗ar-mi-stice astig-ma-tism astir astonish-ment blast-off\cr
⊗by-stand-er candle-stick cast-away cast-off con-test-ant\cr
⊗co-star de-test-able di-gest-ible east-ern ex-ist-ence\cr
⊗fore-stall in-con-test-able in-di-ges*t-*i*ble\cr
⊗in-ex-haust-ible life-style lime-stone live-stock mile-stone\cr
⊗non-ex-ist-ent per-sist-ent pho-to-stat re-start-ed\cr
⊗re-state-ment re-store shy-ster side-step smoke-stack\cr
⊗sug-gest-*i*ble thermo-stat waste-bas-ket waste-land\cr
\\sth:⊗mast-head post-hu-mous priest-hood\cr
\\sw:⊗side-swipe\cr
\\tt:⊗watt-meter\cr
\\tw:⊗be-tween\cr
\\tz:⊗kib-itzer\cr
\\zz:⊗buzz-er\cr}$$

Of course, this is not a complete list of exceptions. But it does seem to
cover all words that have a reasonably high chance of being mis-hyphenated
in \TEX's output,
considering the fact that \TEX\ usually finds a good way to break a
paragraph without any hyphenation at all.

The following words have been also been included in the special dictionary,
since they are common in the author's vocabulary, and since they need more
hyphens than \TEX\ would otherwise find:
$$\vcenter{\halign{\tt#\hfill\cr
al-go-rithm\cr
bib-li-og-raphy\cr
bi-no-mial\cr
cen-ter\cr
com-put-a-*bil-ity\cr
dec-la-ra-tion\cr
de-gree\cr}}\qquad\qquad
\vcenter{\halign{\tt#\hfill\cr
es-tab-lish\cr
gen-er-ator\cr
hap-hazard\cr
neg-li-gible\cr
pe-ri-odic\cr
poly-no-mial\cr
pre-vious\cr}}\qquad\qquad
\vcenter{\halign{\tt#\hfill\cr
prob-a-bil-ity\cr
prob-able\cr
pro-ce-dure\cr
pub-li-ca-tion\cr
pub-lish\cr
re-place-ment\cr
when-ever\cr}}$$

\vfill
\specialappbegin S. {Special notes about using \TEX\ at Stanford}
(1) The standard \TEX\ program that you get by typing ``{\tt r tex}''
requires that fonts {\tt @}, {\tt a}, {\tt d}, {\tt g}, {\tt j}, {\tt l},
{\tt n}, {\tt q}, {\tt u}, {\tt x}, {\tt z}, and {\tt ?} be reserved for
the fonts declared in Appendix B. (The reason is that the system program
already has the font information for these fonts in its memory; this avoids
making \TEX\ reload twelve separate font information files each time.)

\yskip\noindent
(2) The standard \TEX\ program produces output for the XGP. To produce output
for the Alphatype (when it is available) we will use another program ``{\tt
texa}''.

\yskip\noindent
(3) The extension ``{\tt.TEX}'' is assumed to apply to {\≡\input≡\} file
names if you do not specify the extension. If \TEX\ can't find the file
in your area, it tries system area {\tt[1,3]}
before giving up. (File {\tt basic.TEX} is on this area.)

\danger (4) If a font you are using isn't on area {\tt[XGP,SYS]}, you must
mention the area explicitly. \TEX\ ignores the extension on font file names;
the XGP server will assume that the extension is ``{\tt.FNT}'', and \TEX\
assumes that the font information is on another file with the extension
``{\tt.TFX}''. (This applies to XGP fonts only; Alphatype fonts will be on
area {\tt[ALP,SYS]}, and the corresponding \TEX\ font information will have
extension ``{\tt.TFA}''.)

\ddanger (5) Documentation for the \TEX\ processor appears in the file
{\tt TEXSYS.SAI} on area {\tt[TEX,DEK]}, and in several other files mentioned there.